In a study on pollution, carried out in 41 cities in the USA, the following variables were considered:
so2 - Sulfur dioxide content of air in micrograms per cubic meter
temp -Average annual temperature (F)
manuf - No. of manufacturing enterprises employing 20 or more workers
pop - Population size (1970 census) in thousands
wind - Average wind speed in miles per hour
precip - Average annual precipitation in inches
days - Average number of days with precipitation per year (annual average number of days of rain)
UPLOADING THE DATA
#RUN DATA
data5 <- read.csv("../data/data_5.csv", header = TRUE)
head(data5)
## city so2 temp manuf pop wind precip days
## 1 Phoenix 10 70.3 213 582 6.0 7.05 36
## 2 Little R 13 61.0 91 132 8.2 48.52 100
## 3 San Fran 12 56.7 453 716 8.7 20.66 67
## 4 Denver 17 51.9 454 515 9.0 12.95 86
## 5 Hartford 56 49.1 412 158 9.0 43.37 127
## 6 Wilmingt 36 54.0 80 80 9.0 40.25 114
CHOOSING THE VARIABLES
All but the city since itâs not numeric.
dim(data5)
## [1] 41 8
data5_variables <- data5[,2:8]
head(data5_variables)
## so2 temp manuf pop wind precip days
## 1 10 70.3 213 582 6.0 7.05 36
## 2 13 61.0 91 132 8.2 48.52 100
## 3 12 56.7 453 716 8.7 20.66 67
## 4 17 51.9 454 515 9.0 12.95 86
## 5 56 49.1 412 158 9.0 43.37 127
## 6 36 54.0 80 80 9.0 40.25 114
summary(data5_variables)
## so2 temp manuf pop
## Min. : 8.00 Min. :43.50 Min. : 35.0 Min. : 71.0
## 1st Qu.: 13.00 1st Qu.:50.60 1st Qu.: 181.0 1st Qu.: 299.0
## Median : 26.00 Median :54.60 Median : 347.0 Median : 515.0
## Mean : 30.05 Mean :55.76 Mean : 463.1 Mean : 608.6
## 3rd Qu.: 35.00 3rd Qu.:59.30 3rd Qu.: 462.0 3rd Qu.: 717.0
## Max. :110.00 Max. :75.50 Max. :3344.0 Max. :3369.0
## wind precip days
## Min. : 6.000 Min. : 7.05 Min. : 36.0
## 1st Qu.: 8.700 1st Qu.:30.96 1st Qu.:103.0
## Median : 9.300 Median :38.74 Median :115.0
## Mean : 9.444 Mean :36.77 Mean :113.9
## 3rd Qu.:10.600 3rd Qu.:43.11 3rd Qu.:128.0
## Max. :12.700 Max. :59.80 Max. :166.0
Since the measures of each column are very different, we have to use a correlation matrix to start the PCA.
CORRELATION MATRIX
## 1st) Determine the correlation matrix
cor_data5 <- cor(data5_variables)
cor_data5
## so2 temp manuf pop wind precip
## so2 1.00000000 -0.43360020 0.64476873 0.49377958 0.09469045 0.05429434
## temp -0.43360020 1.00000000 -0.19004216 -0.06267813 -0.34973963 0.38625342
## manuf 0.64476873 -0.19004216 1.00000000 0.95526935 0.23794683 -0.03241688
## pop 0.49377958 -0.06267813 0.95526935 1.00000000 0.21264375 -0.02611873
## wind 0.09469045 -0.34973963 0.23794683 0.21264375 1.00000000 -0.01299438
## precip 0.05429434 0.38625342 -0.03241688 -0.02611873 -0.01299438 1.00000000
## days 0.36956363 -0.43024212 0.13182930 0.04208319 0.16410559 0.49609671
## days
## so2 0.36956363
## temp -0.43024212
## manuf 0.13182930
## pop 0.04208319
## wind 0.16410559
## precip 0.49609671
## days 1.00000000
OBTAIN THE EIGENVALUES AND EIGENVECTORS
## 2nd) Obtain eigenvectors and eigenvalues
eigen_data5 <- eigen(cor_data5)
eigen_data5
## eigen() decomposition
## $values
## [1] 2.72811968 1.51233485 1.39497299 0.89199129 0.34677866 0.10028759 0.02551493
##
## $vectors
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.4896988171 -0.08457563 -0.0143502 0.40421007 0.7303942 -0.18334573
## [2,] -0.3153706901 0.08863789 -0.6771362 -0.18522794 0.1624652 -0.61066107
## [3,] 0.5411687028 0.22588109 -0.2671591 -0.02627237 -0.1641011 0.04273352
## [4,] 0.4875881115 0.28200380 -0.3448380 -0.11340377 -0.3491048 0.08786327
## [5,] 0.2498749284 -0.05547149 0.3112655 -0.86190131 0.2682549 -0.15005378
## [6,] 0.0001873122 -0.62587937 -0.4920363 -0.18393719 0.1605988 0.55357384
## [7,] 0.2601790729 -0.67796741 0.1095789 0.10976070 -0.4399698 -0.50494668
## [,7]
## [1,] 0.149529278
## [2,] -0.023664113
## [3,] -0.745180920
## [4,] 0.649125507
## [5,] 0.015765377
## [6,] -0.010315309
## [7,] 0.008217393
CRITERIA 1 => ANALYSING WITH KAISERâS CRITERIA
Analyzing the results, we should retain the first 3 PCs because there are 3 eigenvalues that are >1 (Kaiserâs Criteria).
But now we will perform PCA to see if 3 PCs explain the majority of the
CRITERIA 2 => PCA ANALYSIS
# Perfom PCA
pca_data5 <- princomp(data5_variables, cor = TRUE)
print(summary(pca_data5), loadings = TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.6517021 1.2297702 1.1810897 0.9444529 0.58887916
## Proportion of Variance 0.3897314 0.2160478 0.1992819 0.1274273 0.04953981
## Cumulative Proportion 0.3897314 0.6057792 0.8050611 0.9324884 0.98202821
## Comp.6 Comp.7
## Standard deviation 0.3166822 0.159733920
## Proportion of Variance 0.0143268 0.003644989
## Cumulative Proportion 0.9963550 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## so2 0.490 0.404 0.730 0.183 0.150
## temp -0.315 0.677 -0.185 0.162 0.611
## manuf 0.541 -0.226 0.267 -0.164 -0.745
## pop 0.488 -0.282 0.345 -0.113 -0.349 0.649
## wind 0.250 -0.311 -0.862 0.268 0.150
## precip 0.626 0.492 -0.184 0.161 -0.554
## days 0.260 0.678 -0.110 0.110 -0.440 0.505
Analyzing the results, we have 81% of the variance explained with the first 3 PCs (viewed in the cumulative proportion). However, if we consider 4 PCs, we will have 93% of the variance explained.
Since this is not conclusive, we should do the third criteria.
CRITERIA 3 => SCREE PLOT
## Calculate total variance explained by each PC
var_explained_data5 = pca_data5$sdev^2 / sum(pca_data5$sdev^2)
## Create scree-plot
library(ggplot2)
qplot(c(1:7), var_explained_data5) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot")+
ylim(0,1)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Analyzing the results, in this plot, the elbow point would be between PC2 and PC3. However, if we consider he other criterias we have applied before, it would be best to use 3 PCs, having 81% of the variance explained.
SO LETâS CONSIDER 3 PCs
Now, we need to see which are the variables that contribute more for the explanation of each principal component retained. We need to have a correlation matrix between the variables and the PCs. (Loadings)
# Correlation between original variables and the PCs (loadins)
component_matrix <- cor(data5_variables,pca_data5$scores)
component_matrix
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## so2 0.8088365434 0.10400859 0.01694887 0.38175738 0.43011391
## temp -0.5208984175 -0.10900423 0.79975860 -0.17493906 0.09567234
## manuf 0.8938494595 -0.27778184 0.31553891 -0.02481301 -0.09663572
## pop 0.8053502866 -0.34679989 0.40728458 -0.10710452 -0.20558056
## wind 0.4127189331 0.06821718 -0.36763244 -0.81402520 0.15796972
## precip 0.0003093839 0.76968782 0.58113903 -0.17372001 0.09457328
## days 0.4297383099 0.83374415 -0.12942257 0.10366381 -0.25908903
## Comp.6 Comp.7
## so2 0.05806232 0.023884898
## temp 0.19338547 -0.003779961
## manuf -0.01353294 -0.119030670
## pop -0.02782473 0.103687362
## wind 0.04751936 0.002518265
## precip -0.17530696 -0.001647705
## days 0.15990761 0.001312596
We only need to look into the the 3 first PCs.
Comp 1 => explain so2, manuf and pop
Comp 2 => explain precip and days
Comp 3 => temp
Wind is only explained well with the 4th PC.
Letâs prove that with:
# Apply the rule for the 1st PC
# lambda1 / p
sqrt(eigen_data5$values[1]/8)
## [1] 0.5839649
# Apply the rule for the 2nd PC
# lambda1 / p
sqrt(eigen_data5$values[2]/8)
## [1] 0.4347894
# Apply the rule for the 3rd PC
# lambda1 / p
sqrt(eigen_data5$values[3]/8)
## [1] 0.4175783
GRAPHICAL REPRESENTATION
If we did a graphic just for the first two PCS:
library(devtools)
## Loading required package: usethis
require(ggbiplot)
## Loading required package: ggbiplot
ggbiplot(pca_data5,labels = row.names(data5_variables))
However, we decided to work with the first 3 PCs. So, if we did a graphic for the first 3 PCs:
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Supondo que vocĂȘ tenha feito o PCA:
pca_data5_2 <- prcomp(data5_variables, scale. = TRUE)
pca_scores <- as.data.frame(pca_data5_2$x)
var_explained <- (pca_data5$sdev)^2 / sum((pca_data5$sdev)^2)
pc1_label <- paste0("PC1 (", round(var_explained[1] * 100, 1), "%)")
pc2_label <- paste0("PC2 (", round(var_explained[2] * 100, 1), "%)")
pc3_label <- paste0("PC3 (", round(var_explained[3] * 100, 1), "%)")
plot_ly(pca_scores,
x = ~PC1,
y = ~PC2,
z = ~PC3,
type = 'scatter3d',
mode = 'markers',
text = rownames(pca_scores),
marker = list(size = 4)
)%>%
layout(
title = "3D Graphic with the first 3 PCs",
scene = list(
xaxis = list(title = pc1_label),
yaxis = list(title = pc2_label),
zaxis = list(title = pc3_label)
)
)